System and method for maintaining prefetch stride continuity through the use of prefetch bits

ABSTRACT

A processor includes a cache that has a lines to store data. The processor also includes prefetch bits each of which is associated with one of the cache lines. The processor further includes a prefetch manager that calculates prefetch data as if a cache miss occurred whenever a cache request results in a cache hit to a cache line that is associated with a prefetch bit that is set. In a further embodiment, the prefetch manager prefetches data into the cache based on the distance between cache misses for an instruction.

FIELD OF THE INVENTION

[0001] Embodiments of the present invention relate to prefetching data from a memory. In particular, the present invention relates to methods and apparatus for prefetching data from a memory for use by a processor.

BACKGROUND

[0002] Instructions executed by a processor often use data that may be stored in a system memory device such as a Random Access Memory (RAM). For example, a processor may execute a LOAD instruction to load a register with data that is stored at a particular memory address. In many systems, because the access time for the system memory is relatively slow, frequently used data elements are copied from the system memory into a faster memory device called a cache and, if possible, the processor uses the copy of the data element in the cache when it needs to access (i.e., read to or write from) that data element. If the memory location that is accessed by an instruction has not been copied into a cache, then the access to the memory location by the instruction is said to cause a “cache miss” because the data needed could not be obtained from the cache. Computer systems operate more efficiently if the number of cache misses is minimized.

[0003] One way to decrease the time spent waiting to access a RAM is to “prefetch” data from the system memory before it is needed and, thus, before the cache miss occurs. Many processors have an instruction cycle in which instructions to be executed are obtained from memory in one step (i.e., an instruction fetch) and executed in another step. If the instruction to be executed accesses a memory location (e.g., a memory LOAD), then the data at that location must be fetched into the appropriate section of the processor from a cache or, if a cache miss, from a system memory. A cache prefetcher attempts to anticipate which data addresses will be accessed by instructions in the future and to prefetch this data from the memory before the data is needed. A cache prefetcher typically determines and maintains a data access pattern for an instruction and prefetches data into the cache based on this data access pattern. As used herein, “instruction” refers to a particular instance of an instruction in the program, with each instruction being identified by a different instruction pointer (“IP”) value.

[0004] The performance of a cache prefetching scheme degrades if the data access pattern is not properly managed. A prefetcher maintains access pattern “continuity” if the prefetcher maintains a discovered access pattern as long as the pattern is active and relinquishes an access pattern that is no longer active. A prefetcher operates less efficiently if the continuity of the access patterns are not maintained.

DESCRIPTION OF THE DRAWINGS

[0005]FIG. 1 is a partial block diagram of a computer system having a processor that maintains prefetch stride continuity through the use of prefetch bits according to an embodiment of the present invention.

[0006]FIG. 2 is a partial block diagram of a cache having prefetch bits according to an embodiment of the present invention.

[0007]FIG. 3 is a flow diagram of a method of maintaining prefetch stride continuity through the use of prefetch bits according to an embodiment of the present invention.

[0008]FIG. 4 is a partial block diagram of a computer system having prefetch bits according to another embodiment of the present invention.

DETAILED DESCRIPTION

[0009] Embodiments of the present invention relate to a prefetcher which prefetches data for an instruction based on an access pattern that has been determined and maintained for the instruction. In one embodiment, the access pattern used is the distance between cache misses caused by the instruction. This distance is the stride for the cache misses and may be referred to as the “miss distance” for the instruction. The miss distance may be stored in a prefetch table.

[0010] A prefetcher may incorrectly determine that an access pattern has been dropped when in fact the access pattern is still active. This situation may occur, for example, when the prefetched data is stored in a storage medium, such as a cache, that is not controlled by the prefetcher. If the prefetcher is using a pattern of cache misses as the access pattern, data prefetched into the cache could falsely interrupt the access pattern detected and result in a loss of stride continuity because the prefetched data may cause a cache hit even though this data would have caused a cache miss if it had not been prefetched. That is, the act of prefetching data into the cache causes requests that would have resulted in cache misses to instead result in a cache hit. Thus, while the ultimate object of prefetching is to decrease the number of cache misses, a prefetcher that relies on a pattern of cache misses may disrupt the detected access pattern by the very act of prefetching data into the cache.

[0011] In order to maintain stride continuity, the invention disclosed in this application handles a cache request that results in a cache hit to prefetched data as if a cache miss has occurred. Such a miss may be referred to as a “virtual miss” because an actual cache miss (“actual miss”) did not occur. In an embodiment, a plurality of prefetch bits (or “virtual bits”) are associated with each line in the cache and are used to indicate that the data stored in the associated line was prefetched into the cache. In an embodiment, the prefetcher will calculate the miss pattern and other prefetch data as though a cache miss occurred if a cache request results in a cache hit to a line that is associated with a set prefetch bit. Thus, the prefetch bits store information that is used by a prefetch manager to determine if data was prefetched into the cache.

[0012] Other embodiments use the prefetch bits for purposes in addition to maintaining the continuity of the prefetch access pattern. For example, in an embodiment a prefetch bit may be reset after the first hit to a prefetched cache line (i.e., the first hit occurring after the line is updated with prefetched data) in order to prevent the calculation of a miss distance of zero for instructions with stride that is smaller than the size of a cache line. In a further embodiment, prefetch bits are also used to prevent cache pollution.

[0013]FIG. 1 is a partial block diagram of a computer system having a processor that maintains prefetch stride continuity through the use of prefetch bits according to an embodiment of the present invention. Computer system 100 includes a processor 101 that has a decoder 110 that is coupled to a prefetcher 120. Computer system 100 also has an execution unit 107 that is coupled to decoder 110 and prefetcher 120. The term “coupled” encompasses a direct connection, an indirect connection, an indirect communication, etc. Processor 101 may be may be any micro-processor capable of processing instructions, such as for example a general purpose processor in the INTEL PENTIUM family of processors. Execution unit 107 is a device which performs instructions. Decoder 110 may be a device or program that changes one type of code into another type of code that may be executed. For example, decoder 110 may decode a LOAD instruction that is part of a program, and the decoded LOAD instruction may later be executed by execution unit 107. Processor 101 is coupled to Random Access Memory (RAM) 140. RAM 140 is a system memory. In other embodiments, a type of system memory other than a RAM may be used in computer system 100 instead of or in addition to RAM 140.

[0014] In the embodiment shown in FIG. 1, processor 101 contains a cache 130 that is coupled to execution unit 107, prefetcher 120, and RAM 140. In another embodiment, cache 130 may be located outside of processor 101. Cache 130 may be a Static Random Access Memory (SRAM). In an embodiment, cache 130 contains prefetch bits 135. In a further embodiment, cache 130 contains plurality of lines to store data and each prefetch bit is associated with one of the cache lines. Further details of prefetch bits are discussed below with reference to other figures.

[0015] As shown in FIG. 1, prefetcher 120 includes a prefetch manager 122 and a prefetch memory 125. Prefetch manager 122 may include logic to prefetch data for an instruction based on the distance between cache misses caused by the instruction. As used in this application, “logic” may include hardware logic, such as circuits that are wired to perform operations, or program logic, such as firmware that performs operations. Prefetch memory 125 may store a prefetch table 126 that contains entries including the distance between cache misses caused by an instruction. In an embodiment, prefetch memory 125 is a content addressable memory (CAM). Prefetch manager 122 may determine the addresses of data elements to be prefetched based on the miss distance that is recorded for instructions in the prefetch table.

[0016]FIG. 2 is a partial block diagram of a cache having prefetch bits 135 according to an embodiment of the present invention. FIG. 2 shows cache 130 including a data array 240 and a least recently used (LRU) array 250. As shown in FIG. 2, data array 240 contains a plurality of cache lines 245. Each cache line may be, for example, 32 bytes long. In an embodiment, data array 240 may be organized into sets and ways, as per conventional techniques, and cache 130 may contain other arrays such as for example a tag array. As would be appreciated by a person of skill in the art, LRU array 250 may contain recency of use information that is used, for example, to determine cache lines to be evicted when a portion of the data array 240 becomes full. In an embodiment of the present invention, prefetched bits 135 are stored as part of the LRU array 250. In this embodiment, LRU array 250 contains a prefetch bit for each cache line 135 in data array 240. In this embodiment, each cache line 245 is associated with one of the prefetch bits 135. In another embodiment, the prefetch bits 135 may be located in a part of the cache 130 other than LRU array 250.

[0017]FIG. 2 shows a cache manager 260 that is coupled to data array 240 and LRU array 250. In the embodiment shown, cache manager 260 contains prefetch bit management logic 261 and recency of use logic 262. In an embodiment, the prefetch bit management logic 261 manages the values stored in the prefetch bits 135. For example, the prefetch bit management logic 261 may set a prefetch bit each time that a cache line is updated with data that was prefetched into the cache. In an embodiment, the prefetcher 120 sends a signal to prefetch management logic 261 whenever the data loaded into the cache is prefetched data. In a further embodiment, prefetch bit management logic 261 resets a prefetch bit in response to a read from a data array line associated with the prefetch bit. Recency of use logic 262 may store recency of use information in LRU array 250 which information is associated with each data array line. In an embodiment, the recency of use logic 262 stores information indicating that a data array line has a status of least recently used whenever the data array line is updated with data that was prefetched into the cache. In a further embodiment, the recency of use logic 262 stores information indicating that the data array line last read has a status of most recently used unless the data array line is associated with a prefetch bit that indicates data being stored in this data array line was prefetched into the cache.

[0018] In an embodiment, a set prefetch bit may indicate that the associated data array line contains prefetched data. As shown in FIG. 2, two of the prefetch bits shown are set (i.e., they have a value of “1”) and five of the prefetch bits shown are not set (i.e., have a value of “0”). If data is loaded into the cache in response to a cache miss, this data would not have been prefetched and thus the associated prefetch bit may indicate that the data was not prefetched. Of course, any value may be used to indicate that the prefetch bit is set. In an embodiment that is discussed in more detail below, the prefetch bit may be cleared the first time that prefetched data is loaded, even though the associated cache line will still contain prefetch data, to handle the case where more than one miss occurs for an instruction in the same cache line.

[0019] An example of the operation of the present invention is described with reference to FIG. 3. FIG. 3 is a flow diagram of a method of maintaining prefetch stride continuity through the use of prefetch bits according to an embodiment of the present invention. The method shown in FIG. 3 may be used with a system such as that shown in FIGS. 1-2. The processor 101 may be executing a program that contains instructions. As shown in FIG. 3, decoder 101 may decode an instruction (301). This instruction may be, for example, a LOAD instruction that has an IP of XXXX. The LOAD instruction may load data from a location in RAM 140, for example the data element at address YYY. In the example shown in FIG. 3, the instruction decoded has been executed a number of times in the past. This allowed prefetcher 120 to determine an access pattern for the instruction (information on which is stored in prefetch table 126) and to prefetch the next data element to be loaded from RAM according to the access pattern. Thus, in this example the data at address YYY has already been prefetched into a line of cache 130. Because this data was prefetched from the RAM into a line of cache 130, at the time the data was prefetched a prefetch bit associated with the cache line was set by prefetch bit management logic 261.

[0020] According to the example shown in FIG. 3, after decoding the instruction the decoder 110 may cause a request to be sent to cache 130 for a data element (e.g., the data stored at address YYY) that is to be used by the instruction (302). Prefetcher 120 will receive information about the response to the cache request and will determine whether the request resulted in a cache hit (303). In this example, the request would have resulted in an actual miss if the data had not been prefetched into the cache. If the request resulted in a cache miss, prefetcher 120 calculates prefetch information for the instruction based on the request having resulted in a cache miss (304). If the request resulted in a cache hit, the prefetcher 120 obtains information for the prefetch bit associated with the cache line that contains the data element requested (305). The prefetcher 120 then determines if the information indicates that the data element was prefetched into the cache (306). If the information indicates that the data element had been prefetched into the cache, the prefetcher 120 treats the request as a virtual miss and calculates prefetch information for the instruction based on the request having resulted in a cache miss (304). If the information indicates that the data element had not been prefetched into the cache, the prefetcher 120 calculates prefetch information for the instruction based on the request having resulted in a cache hit (307). In this embodiment, the cache manager will generate a cache miss response for a cache request if data requested is stored in the cache and the cache manager determines that the data was prefetched into the cache. The cache prefetcher receives cache miss responses from the cache manager and prefetches data into the cache based on the distance between cache misses for an instruction. In an embodiment, the prefetcher 120 updates prefetch table 126 to indicate that a miss response was received whenever either an actual miss or a virtual miss response was received.

[0021] In the example above, the prefetcher may have detected an access pattern of every fifth address because a cache miss has been detected occurring at every fifth address (e.g., 0x0005, 0x0010, 0x005, 0x0020, . . . ) for this instruction. Thus, the prefetcher will prefetch the address that is five addresses away from the last address accessed by that instruction (because that is the next expected miss). Once the data element at address 0x0025 has been prefetched into the cache, however, it will not cause an actual cache miss. If the prefetching scheme is based on a detected pattern of cache misses, the presence of the prefetched data element from address 0x0025 in the cache could cause the prefetcher to determine that the pattern has been interrupted (because the request for address 0x0025 did not cause an actual cache miss) even though the access pattern is actually still valid. Thus, the learned stride access pattern of 5 may become corrupted. According to embodiments of the invention disclosed in this application, the prefetcher will determine, based on the content of the prefetch bit for the cache line in question, that the request caused a virtual miss. Thus, the prefetcher will update the prefetch information (e.g., the miss distance) for the instruction as if the request generated an actual miss.

[0022] In a further embodiment, the prefetch bit management logic 161 prevents the calculation of a miss distance of zero for instructions that have a stride greater than zero but less than the size of a cache line. In this embodiment, whenever a request results in a virtual miss, the prefetch bit management logic 261 resets the prefetch bit associated with the cache line that contains the data requested. The next time that this data is requested from the cache, the cache will respond to the request by indicating that an actual hit has resulted, even though the data had been prefetched, because the prefetch bit will have been reset. If the stride of the instruction is less than a cache line apart, the addresses requested by two or more instructions' could occur in the same cache line. Thus, a virtual miss would be generated by the same cache line with a stride of zero every time these instructions hit the same cache line with a prefetch bit set. Clearing the prefetch bit after the first hit to the prefetched cache line prevents this case from occurring.

[0023] In a further embodiment, the cache manager 260 stores recency of use information for the plurality of cache lines and uses information from the prefetch bits to determine this recency of use information. In an embodiment, the recency of use logic 262 stores information in LRU array 250 indicating that a data array line has a status of least recently used whenever the data array line is updated with data that was prefetched into the cache. According to this embodiment, data that has been prefetched into the cache, but has not yet been used, may be selected first for eviction. The recency of use logic 262 stores information indicating that the data array line last read has a status of most recently used unless the data array line is associated with a prefetch bit that indicates data being stored in this data array line was prefetched into the cache. According to this embodiment, a cache line containing prefetched data that is hit a first time will not be changed to a status of most recently used. Thus, prefetched data that is hit only once may also be evicted first. The prefetch bit is cleared once the cache line is hit, and thus upon the second hit to the cache line the recency of use logic 262 will treat the cache line as it if were not prefetched and will change its status to most recently used. The above embodiments for reducing cache pollution use the same data structure (i.e., the prefetch bits) as is used to indicate that a cache line contains prefetched data. If data is prefetched into the cache that is not accessed or reused, this data will first be replaced.

[0024]FIG. 4 is a partial block diagram of a computer system having prefetch bits according to another embodiment of the present invention. Similar to FIG. 1, FIG. 4 shows a computer system 400 that contains a processor 401 that is coupled to a RAM 440. Processor 401 contains a decoder 410 coupled to an execution unit 407. Processor 401 also contains a prefetcher 420 that is coupled to decoder 410 and execution unit 407. Computer system 400 contains a cache 430 that is coupled to processor 401 and to RAM 440. Unlike the processor 101 of FIG. 1, processor 401 also contains a read request buffer 470 that is coupled to prefetcher 420, cache 430, and RAM 440. In this embodiment, prefetch bits 475 are attached to read request buffer 470. Read request buffer 470 may be a cache fill buffer that starts the prefetch request to memory. When this embodiment is used, the prefetch bit may be associated with the cache line before the data is brought into the cache. If the same instruction hits the prefetch line when it is still in the request stage, then the stride continuity may be maintained and the new prefetch request may be issued while the old prefetch request is in progress.

[0025] Embodiments of the present invention relate to a prefetcher which prefetches data for an instruction based on an access pattern that has been determined and maintained for the instruction. The present invention maintains stride continuity by handling cache requests resulting in a cache hit to prefetched data as if a cache miss had occurred. Several embodiments of the present invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention. For example, any combination of one or more of the aspects described above may be used. In addition, the invention may be used with physical address or linear addresses. In addition, the invention may be used with prefetch schemes based on different types of access patterns including those based on a sequential, linear or series patterns. 

What is claimed is:
 1. A processor comprising: a cache having plurality of lines to store data; a plurality of prefetch bits each associated with one of the cache lines; and a prefetcher to calculate prefetch data as though a cache miss occurred if a cache request results in a cache hit to a line that is associated with a set prefetch bit.
 2. The processor of claim 1, wherein the prefetch data includes miss distance information for instructions.
 3. The processor of claim 1, wherein the prefetch bits are stored in the cache.
 4. The processor of claim 1, wherein the prefetch manager has logic to reset a prefetch bit associated with a cache line whenever a cache request results in a cache hit to the cache line and the prefetch bit was set.
 5. The processor of claim 1 further comprising logic to store recency of use information for the plurality of cache lines which logic uses information from the prefetch bits to determine the recency of use information.
 6. A cache comprising: a data array having plurality of lines to store data; and a plurality of prefetch bits each associated with one of the data array lines to indicate that data stored in the associated line was prefetched into the cache.
 7. The cache of claim 6, wherein the cache contains logic to reset a prefetch bit in response to a read from the data array line associated with the prefetch bit.
 8. The cache of claim 6, wherein the cache further comprises a least recently used (LRU) array, and wherein said plurality of prefetch bits are located in the LRU array.
 9. The cache of claim 6, wherein the cache has logic to store recency of use information associated with each data array line, and wherein said logic stores information indicating that a data array line has a status of least recently used whenever the data array line is updated with data that was prefetched into the cache.
 10. The cache of claim 9, wherein said logic stores information indicating that the data array line last read has a status of most recently used unless the data array line is associated with a prefetch bit that indicates data being stored in this data array line was prefetched into the cache.
 11. A processor comprising: a cache; an instruction decoder to decode instructions and to cause cache requests to be sent for data to be used by the instructions decoded; a cache manager to generate a cache miss response for a cache request if data requested is stored in the cache and the cache manager determines that the data was prefetched into the cache; and a cache prefetcher to receive cache miss responses from the cache manager and to prefetch data into the cache based on the distance between cache misses for an instruction.
 12. The processor of claim 11, wherein the processor further includes a plurality of prefetch bits that store information used by the prefetch manager to determine if data was prefetched into the cache.
 13. The processor of claim 12, wherein the cache contains a Read request buffer, and wherein the plurality of prefetch bits are attached to the Read request buffer.
 14. The processor of claim 12, wherein the processor contains logic to reset the prefetch bit associated with prefetched data in response to the first hit to the prefetched data, and wherein the cache manager will determine that data was not prefetched if the prefetch bit associated with said data is reset.
 15. The processor of claim 11, wherein the cache contains bits to store information about the status of each line in the cache, and wherein the cache contains logic to update the status of a cache line to least recently used whenever prefetched data is stored in the cache line and to update the status of said cache line to most recently used whenever the prefetched data is read a second time after the data is prefetched into the cache.
 16. A method of maintaining the continuity of prefetch information, the method comprising: decoding an instruction a first time; sending a first request to a cache for data to be used by said instruction; determining that the data requested in the first request is stored in a line in a cache; determining that a prefetch bit associated with said cache line indicates that the cache line stores data that was prefetched into the cache; and calculating prefetch information for said instruction, wherein the prefetch information is calculated based on the first request having resulted in a cache miss.
 17. The method of claim 16, wherein the method further comprises resetting the prefetch bit associated with the cache line.
 18. The method of claim 16, wherein the calculation of prefetch information for an instruction comprises calculating miss distance information for the instruction.
 19. The method of claim 16, further comprising: decoding a second instruction which is to use said data; sending a second request to the cache for said data; determining that the data requested in the second request is stored in a line in the cache; determining that the prefetch bit associated with said cache line indicates that the cache line stores data that was not prefetched into the cache; calculating prefetch information for the instruction, wherein the prefetch information is calculated based on the second request having resulted in a cache hit; and updating the status information corresponding to the cache line to indicate that the cache line was most recently used.
 20. A processor comprising: a cache having a plurality of cache lines; a means for prefetching data into one of the cache lines; and a means for indicating that a cache line contains prefetched data.
 21. The processor of claim 20, further comprising a means for determining that a virtual miss has occurred in response to a request for data sent to the cache whenever the data is stored in a cache line and the means for indicating indicates that the cache lines contains prefetched data.
 22. The processor of claim 20, wherein the means for prefetching updates a prefetch table to indicate that a miss response was received whenever a response was received for an actual miss or a virtual miss.
 23. The processor of claim 20 further comprising a means for preventing the calculation of a miss distance of zero for instructions that have a miss distance that is greater than zero but less than the size of a cache line.
 24. The processor of claim 20, wherein the means for indicating that a cache line contains prefetched data stores a data structure that is used to determine whether a cache line contains prefetched data, and wherein the processor further comprising a means for reducing cache pollution that uses the same data structure as said means for indicating. 